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Summary 

A spatial auditory display was used to convolve speech 
stimuli, consisting of 130 different call signs used in the 
communications protocol of NASA’s John F. Kennedy 
Space Center, to different virtual auditory positions. An 
adaptive staircase method was used to determine 
intelligibility levels of the signal against diotic speech 
babble, with spatial positions at 30° azimuth increments. 
Non-individualized, minimum-phase approximations of 
head-related transfer functions were used. The results 
showed a maximal intelligibility improvement of about 
6 dB when the signal was spatialized to 60° or 90° 
azimuth positions. 

1. Introduction 

1.1 Application to NASA Communication Systems 

During fiscal year 1992, NASA Director’s Discretionary 
Funding was received from Ames Research Center 
(ARC) and John F. Kennedy Space Center (KSC) by 
Drs. E. M. Wenzel and D. R. Begault, to develop a four 
channel spatial auditory display for application to 
multiple channel speech communication systems in use at 
KSC. A previously specified design (Begault and Wenzel, 
1990; Begault, 1992a) was used to fabricate a prototype 
device, which was completed in February 1993. This 
prototype places four different communication channels 
in virtual auditory positions about the listener by digitally 
filtering each input channel with binaural head-related 
transfer function (HRTF) data. Listening over head- 
phones, one has a spatial sense of each channel origi- 
nating from a unique position outside the head; i.e., as if 
four people were standing about you, speaking from 
different directions. 

Input channels to the spatial auditory display can be 
assigned to any position because the design uses four 
removable EPROMs, 1 with each EPROM corresponding 
to a particular target position. The EPROMs themselves 
can contain a binaural HRTF for any given position and 
measured ear. Hence, an important research question is to 
determine which four positions would be optimal for 
speech intelligibility of multiple sound sources. To begin 
to answer this question, the current investigation focused 
on what single spatialized azimuth position yielded 
maximal intelligibility against noise. This was accom- 
plished by measuring intelligibility thresholds at 30° 
azimuth increments. Intelligibility is defined here as 


EPROM = erasable-programmable-read-only memory chip. 


correct identification of a spatialized call sign (signal) 
against diotic 2 speech babble (noise). 

The KSC communications handbook (NASA-KSC, 1991) 
indicates a list of over 3000 call signs, most of which are 
spoken as four individual letters — e.g., “NTOC.” 
Communication personnel who monitor multiple radio 
frequencies must be able to hear these four letters clearly 
against speech. The use of speech babble as a noise 
source has been used in several studies investigating 
binaural hearing for communication systems contexts 
(e.g., Pollack and Pickett, 1958). This study concludes 
with a first approximation of the answer to what HRTF 
positions are best used in the filter EPROMs within the 
prototype. 

1.2 Binaural Advantages and Speech Intelligibility 

The relationship between binaural hearing and the 
development of improved communication systems has 
been understood for over 45 years (Licklider, 1948; see 
reviews in Blauert, 1983; Zurek, 1993). As opposed to 
monotic (one ear) listening — the typical situation in 
communications operations — binaural listening allows a 
listener to use head-shadow and binaural interaction 
advantages simultaneously (Zurek, 1993). The head- 
shadow advantage is an acoustical phenomenon, caused 
by the interaural level differences that occur when a 
sound moves closer to one ear relative to the other. 
Because of the diffraction of lower frequencies around the 
head from the near ear to the far ear, only frequencies 
above approximately 1 .5 kHz are shadowed in this way. 
The binaural interaction advantage is a psychoacoustic 
phenomenon due to the auditory system’s comparison of 
binaurally-received signals (Levitt and Rabiner, 1967; 
Zurek, 1993). 

Many studies have focused on binaural advantages for 
both detecting a signal against noise (the binaural 
masking level difference, or BMLD) and improving 
speech intelligibility (the binaural intelligibility level 
difference, or BILD). Studies of BMLDs and BILDs 
involve manipulation of signal processing variables 
affecting either signal, noise, or both. The manipulation 
can involve phase inversion, time delay, and/or filtering. 

Recently, speech intelligibility studies by Bronkhorst and 
Plomp (1988; 1992) have used a mannequin head to 
impose the filtering effects of the HRTF on both signal 
and noise sources. The HRTFs were used in either an 
unaltered condition, or with either time or amplitude 
components removed. Their results, summarized in 


2 “Diotic” playback is defined as a single audio channel 
presented to both ears. 


figure 1, show a 6 to 10 dB advantage with the signal at 
0° azimuth and speech-spectrum noise moved off axis, 
compared to the condition where speech and noise 
originated from the same position. Figure 1 also shows 
lower BILDs when either interaural time or amplitude 
differences are removed from the stimuli. This suggested 
the inclusion of HRTF filtering within a binaural display 
for speech communication systems (ref Begault and 
Wenzel, 1990; Begault and Wenzel, 1992). According to 
a model proposed by Zurek ( 1 993), based on averaged 
HRTFs specified in Shaw and Vaillancourt (1985), the 
average binaural advantage (speech signal fixed at 0°, 
noise uniformly distributed across all azimuths, head free 


to move) is around 5 dB, with head shadowing contribut- 
ing about 3 dB and binaural-interaction about 2 dB. 

Another advantage for binaural speech reception relates 
to the ability to switch voluntarily between multiple 
channels, or “streams,” of information (Bregman, 1990; 
Deutsch, 1983). The improvement in the detection of a 
desired speech signal against multiple speakers 
commonly referred to as the “cocktail party effect” 
(Cherry, 1953; Cherry and Taylor, 1954) is explained by 
Bregman (1990) as a form of auditory stream segregation. 
This situation was found to parallel the multiple channel 
listening requirements of communication personnel, such 
as test directors (NTDs) at KSC. 
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Figure 1. Data from Bronkhorst and Plomp (1988) for speech intelligibility gain. All stimuli were recorded with a mannequin 
head. Speech signal fixed at 0°; noise moved along azimuth at 0° elevation. FF = data including effects of the HRTF; 
dT= same data with binaural amplitude differences removed; dL - same data but with binaural time differences removed. 
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2. Method 

2.1 Stimuli 

The signal portion of the stimulus was drawn from a list 
of 130 four-letter call signs, selected from the KSC 
communication handbook (NAS A-KSC, 1991). The 
130 call signs used in the experiment were selected 
randomly so that groups of five began with a unique letter 
of the alphabet. A single male voice was used, with each 
letter of the call sign spoken discontinuously over a 
duration of about two seconds. Recordings took place in 
sound-proof booth, using an AKG C451-EB microphone 
at a distance of 6 inches. Once digitized, each call sign 
combination was normalized in amplitude, and then 
scaled to have equal long-term r.m.s. measurement 
values. 

The speech babble used for the noise portion of the 
stimulus consisted of multiple layers of voices: Two 
layers were from different airport control tower frequen- 
cies, containing both female and male voices, with silent 
intervals of more than 0.2 seconds deleted; and two addi- 
tional layers consisted of recordings of different male 
voices reading technical repair manuals, one played 
backwards, the other pitch shifted upwards 4 semitones. 
The result was a dense speech layer in which words could 
occasionally be distinguished, but semantic content was 
lost. 

The noise and speech were digitally stored as separate 
channels of stereo sound files (fig. 2) using an Apple 
Macintosh II fic and Digidesign’s ProTool hardware and 
software. The duration of each sound file used in each 


stimulus presentation was adjusted to 5 seconds, with the 
noise channel faded in and out over the first and last 
0.5 seconds. The signal was always presented 1.5 seconds 
into the sound file, allowing subjects to predict its onset. 

Each of the 130 separate noise-signal sound files was 
played through signal processing software and hardware, 
using a Crystal River Engineering Convolvotron that also 
served as the experimental software host computer (see 
Wenzel, 1992, for additional information on the hard- 
ware). Upon playback, the Convolvotron passed the 
speech babble channel unaltered to both ears. Mixed in 
with this noise was the two-channel signal, after software 
intensity scaling and HRTF-based spatialization to 
azimuths at 30 degree increments between 30°-330° (all 
at 0° elevation). A diotic control condition was also used 
for the signal, where the spatialization was bypassed and 
only intensity scaling was used. 

The minimum-phase HRTFs used for the spatialization 
were reconstructed from actual HRTF measurements as 
described in Kistler and Wightman (1992). The original 
measurements used were of one subject (SDO in 
Wightman and Kistler, 1989), with the headphone 
frequency response (Sennheiser HD-430) divided out of 
the HRTF. Although the same model of headphone was 
used for the subjects in this experiment, non-linearities in 
reproducing the HRTF were introduced as a result of the 
interaction between different pinnae and the headphone 
chambers. Data on localization error of speech with non- 
individualized HRTFs can be found in Begault and 
Wenzel (1991) and Begault (1992b). 
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Figure 2. Stimulus soundfile arrangement (1 of 130); see text 
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2.2 Subjects 

Five subjects (4 males, 1 female) were paid $5.59 an hour 
to participate in the study over two 3-hour sessions. This 
was the “naive subjects” group in that they had no 
exposure to the call sign list. Another group of 3 lab 
personnel (3 males) who had previous exposure to the call 
sign list constituted the “experienced subjects” group; 
their data is analyzed separately from the naive subject 
group. This group included a subject whose voice was 
used in the signal. 

All subjects were evaluated for normal hearing from 
0.1-8 kHz in a pure tone audiometer test. Subjects were 
given a training session before starting the experiment to 
familiarize themselves with the computer, the time when 
to expect the signal in relation to the noise, and the 
procedure for entering responses. This training session 
consisted of a dummy block where the level of the signal 
was clearly audible against the noise, and was never 
scaled. The formal blocks were begun after approximately 
20 trials. 

2.3 Procedure 

Software was developed by Phil Stone (Sterling 
Software) for presenting stimuli and gathering data from 
subjects using an interleaved, transformed up-down 
“staircase” method (Levitt, 1970). The software varied 
the level of the signal against the noise, starting with a 
maximum stepsize interval of 6 dB, and decreasing to a 
minimum stepsize of 1 dB. The response sequences were 
evaluated in such a way as to determine the threshold at a 
70.7% probability level (a “2 up, 1 down” procedure). 

The decibel level between the diotic stimuli and the 
spatial ized stimuli were considered to be equal with 
reference to the long-term r.m.s. value of speech- 
spectrum noise filtered by a left ear 0° HRTF (obtained 
from the same HRTF set used for the other spatial ized 
positions). The playback level was around 55 dB SPL 
when the noise and 0° HRTF-Filtered calibration signals 
were played simultaneously. 

Six blocks were administered to each subject over three 
or four days, with each block containing four staircases 
randomly chosen from the eleven possible spatial 
positions or the one diotic signal condition. The four 
staircases within each block were presented randomly, as 
were the 1 30 call sign-speech babble sound Files used for 
a particular stimulus block. The staircases within the 
blocks were arranged so that ten threshold values were 
obtained from each subject for each spatial condition, and 
the diotic condition. No block contained two simulta- 


neous staircases for the same spatial condition of the 
signal. 

Upon hearing the stimulus, the subject typed the four 
letters they thought they had heard onto a computer 
keyboard, and then after a short pause the software would 
present the next trial. The duration to complete each block 
of four staircases was about 15-20 minutes. Testing was 
administered in a sound-proof booth. No feedback was 
given as to the correct identification of the call signs; the 
subjects were only notified when the 20 staircases within 
a particular block (four spatial conditions times five 
staircases) were completed. 

3. Results 

Figure 3 summarizes the data for the five naive subjects, 
and figure 4 summarizes the data for the three experi- 
enced'subjects. The mean values for each position were 
obtained before grouping the data by first subtracting 
each individual subject’s threshold for the diotic signal 
vs. diotic speech babble condition. The results in figures 3 
and 4 show a greater intelligibility advantage as the signal 
is moved from to either side of the head; the advantage is 
maximal between 60°-90° and 270°-300°. These are 
locations where both head-shadowing is maximized and 
where the binaural interaction advantage mechanism is 
given maximal time differences. 

Figure 5 summarizes figures 3 and 4, by showing the 
mean values for symmetrical left-right positions about the 
head. This suggests, without reference to which side a 
sound is spatialized, that the preferred order for HRTF- 
processing for maximal intelligibility is 60° or 90°, then 
120°, then 30°, then 150°, and finally 180°. The latter is 
hardly better than performance with the diotic stimuli. 
Figure 5 also shows that the three experienced subjects 
achieved about a 1 dB additional intelligibility advantage 
over the five naive subjects. However, an analysis of 
variance revealed that no significant difference existed 
between these two subject categories, F (1 ,6) = 2.90, 
p = 0.14. 

The mean values for four of the naive subjects had a 
pattern that followed the symmetrical trend of the overall 
mean shown in figure 3; there seemed to be no preferred 
side to hear the signal. Contrasting this, the responses of 
one of the naive subjects had an asymmetrical trend, 
favoring right side positions over left side positions. This 
trend was similar to a potential subject whose data was 
excluded from the subject pool and the analysis above 
due to hearing loss at the left ear (between 20-35 dB HL 
at 4, 6, 8, and 12 kHz). 
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Azimuth of Spatialized Signal 

Figure 3. Data for the naive subject group (4 males , 1 female). The mean value for the diotic signal condition were 
subtracted from each spatialized signal value. Standard deviation bars were based on the 10 staircase solutions obtained 
for each condition. 



azimuth of spatialized signal 

Figure 4. Data for the experienced subject group (see fig. 3). 
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■ Naive Subj.s 

O Experienced Subj.s 



Azimuth of spatialized signal 
(mean of left & right sides) 

Figure 5. Mean values from figures 3 and 4 collapsed about symmetrical left-right positions . 


Figure 6 shows the results for these two subjects, along 
with the overall means from the naive subject group. 
Except for the 60° azimuth position, both of these 
subjects have a lesser advantage for left side positions 
compared to the overall mean, and right side positions 
show a greater advantage. Additional data would be 
needed to determine if there was a significant effect due 
to handedness or other factors (Deutsch, 1983). 
Nevertheless, a person with asymmetrical hearing loss, 


similar to that experienced by the subject shown in 
figure 6, could still benefit from using a 3-D auditory 
display. Gabriel, Koehnke and Colburn (1991) and 
Perrotf Sadralodabi, Saberi and Strybel (1991) have 
pointed out that, excluding severe hearing loss, no 
apparent relation between audiometric measurements 
and binaural performance can be established. 
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O subject with normal hearing 

X subject with asymmetrical hearing loss 
■ overall mean for naive subject group 



azimuth of spatialized signal 

Figure 6. Two subjects (one from the naive group, one subject w/ asymmetrical hearing loss) who tended to favor the right 
side positions over the left . Overall means (from fig. 3) shown for comparison. 


4. Discussion 

Overall, a 6-7 dB advantage for left and right 60° and 90° 
positions was found in the present study, which exceeds 
the binaural advantage cited in Zurek’s model (1993) by 
1-2 dB. This means that headphone listening with static 
spatial positions through the hardware prototype is at least 
as good as a normal hearing, binaural listener who is free 
to move their head. Although Bronkhorst and Plomp 
(1988) found a 10 dB advantage for a signal at 0° azimuth 
and speech-spectrum noise at 90°, their results are not 
directly comparable to those found here since both signal 
and noise were HRTF-filtered by their mannequin head, 
and in the present study the noise portion of the stimulus 
was diotic. The additional release from masking they 


found may have been attained through either HRTF- 
filtering of both signal and noise, the use of noise rather 
than speech babble, or both. 

The results found here are limited by the fact that only 
one male speaker was used for the signal portion of the 
stimulus. In spite of the care taken in preparing the 
stimulus through digital editing, there is the potential 
that extraneous variation was introduced into the results 
because of the variability of spoken intelligibility (ANSI, 
1989). Furthermore, the average spectrum of this partic- 
ular speaker might have interacted differently with the 
HRTF filtering than that of another speaker (e.g., a female 
voice). Finally, the variability in HRTF measurements 
from different persons or reconstruction techniques could 
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influence the results of any experiment that uses only one 
set of HRTFs. This is one reason the prototype was 
designed to allow interchangeable EPROMs: individuals 
could tailor systems to their best advantage by using a 
preferred set of HRTFs. 

5. Conclusion 

The advantage of a binaural auditory display for multiple 
communication channels has been demonstrated, through 
a case study of a single signal at incremented 30° azimuth 
positions against a diotic, speech babble noise source. The 
6-7 dB advantage for 60° and 90° HRTF-fiitered speech 
represents a halving of the intensity (acoustic power) 
necessary for correctly identifying a four letter call signs 
typical of those used in communication systems at KSC. 
This reduction in the likelihood of misinterpreting call 
signs over communication systems is an important safety 
improvement for “high stress,” human-machine interface 
contexts. The binaural advantage could also benefit 
communications personnel because the overall intensity 
of communications hardware could be reduced without 
sacrificing intelligibility. Lower listening levels over 
headphones could possibly reduce the risk of threshold 
shifts, the Lombard Reflex (raising the intensity of one’s 
own voice; see Junqua, 1993), and overall fatigue, 
thereby making additional contributions to safety. 

Overall, the findings here suggest that the use of a spatial 
auditory display could enhance both occupational and 
operational safety and efficiency of NASA operations. 
Additional studies are underway at Ames to simulate 
other applications scenarios within speech intelligibility 
experiments to determine the additional benefits, if any, 
of spatial audio communications displays. 
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